This is a temporary project that exists only to complete mandatory classwork for the Open Data Science class at Helsinki University. I expect to hide or remove this repository after December 2020.
This repository is for an introductory open data science class for PhD students at University of Helsinki (with some students from other universities). The course is open to everyone, uses open source tools, and emphasizes open science. The statistics part seems fairly elementary, but I could use a brush-up and that is why I signed up for the course.
The course seems to cover github, R, Rstudio, linear and logistic regression, dimensionality reduction, clustering, hypothesis testing and some other basic multivariate statistical analysis tools.
Links to my repositories and web pages for the class:
Here is code and output from data/create_learning2014.R:
# Tatu Ylönen / University of Helsinki
# 1.11.2020
# RStudio Exercise 2 - data wrangling
# This uses the data set described in:
# Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183
# Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014
# (Introduction to Social Statistics, fall 2014 - in Finnish),
# international survey of Approaches to Learning, made possible
# by Teachers' Academy funding for KV in 2013-2015.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the data into a data frame.
data <- read.csv(file="http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t")
# Print the dimensions of the data.
print("=== dimensions of the data JYTOPKYS3-data dataset")
## [1] "=== dimensions of the data JYTOPKYS3-data dataset"
dim(data)
## [1] 183 60
# Print a summary of what the data frame contains.
# Turns out there are 183 observations for 60 variables.
print("Structure of the JYTOPKYS3-data dataset")
## [1] "Structure of the JYTOPKYS3-data dataset"
str(data)
## 'data.frame': 183 obs. of 60 variables:
## $ Aa : int 3 2 4 4 3 4 4 3 2 3 ...
## $ Ab : int 1 2 1 2 2 2 1 1 1 2 ...
## $ Ac : int 2 2 1 3 2 1 2 2 2 1 ...
## $ Ad : int 1 2 1 2 1 1 2 1 1 1 ...
## $ Ae : int 1 1 1 1 2 1 1 1 1 1 ...
## $ Af : int 1 1 1 1 1 1 1 1 1 2 ...
## $ ST01 : int 4 4 3 3 4 4 5 4 4 4 ...
## $ SU02 : int 2 2 1 3 2 3 2 2 1 2 ...
## $ D03 : int 4 4 4 4 5 5 4 4 5 4 ...
## $ ST04 : int 4 4 4 4 3 4 2 5 5 4 ...
## $ SU05 : int 2 4 2 3 4 3 2 4 2 4 ...
## $ D06 : int 4 2 3 4 4 5 3 3 4 4 ...
## $ D07 : int 4 3 4 4 4 5 4 4 5 4 ...
## $ SU08 : int 3 4 1 2 3 4 4 2 4 2 ...
## $ ST09 : int 3 4 3 3 4 4 2 4 4 4 ...
## $ SU10 : int 2 1 1 1 2 1 1 2 1 2 ...
## $ D11 : int 3 4 4 3 4 5 5 3 4 4 ...
## $ ST12 : int 3 1 4 3 2 3 2 4 4 4 ...
## $ SU13 : int 3 3 2 2 3 1 1 2 1 2 ...
## $ D14 : int 4 2 4 4 4 5 5 4 4 4 ...
## $ D15 : int 3 3 2 3 3 4 2 2 3 4 ...
## $ SU16 : int 2 4 3 2 3 2 3 3 4 4 ...
## $ ST17 : int 3 4 3 3 4 3 4 3 4 4 ...
## $ SU18 : int 2 2 1 1 1 2 1 2 1 2 ...
## $ D19 : int 4 3 4 3 4 4 4 4 5 4 ...
## $ ST20 : int 2 1 3 3 3 3 1 4 4 2 ...
## $ SU21 : int 3 2 2 3 2 4 1 3 2 4 ...
## $ D22 : int 3 2 4 3 3 5 4 2 4 4 ...
## $ D23 : int 2 3 3 3 3 4 3 2 4 4 ...
## $ SU24 : int 2 4 3 2 4 2 2 4 2 4 ...
## $ ST25 : int 4 2 4 3 4 4 1 4 4 4 ...
## $ SU26 : int 4 4 4 2 3 2 1 4 4 4 ...
## $ D27 : int 4 2 3 3 3 5 4 4 5 4 ...
## $ ST28 : int 4 2 5 3 5 4 1 4 5 2 ...
## $ SU29 : int 3 3 2 3 3 2 1 2 1 2 ...
## $ D30 : int 4 3 4 4 3 5 4 3 4 4 ...
## $ D31 : int 4 4 3 4 4 5 4 4 5 4 ...
## $ SU32 : int 3 5 5 3 4 3 4 4 3 4 ...
## $ Ca : int 2 4 3 3 2 3 4 2 3 2 ...
## $ Cb : int 4 4 5 4 4 5 5 4 5 4 ...
## $ Cc : int 3 4 4 4 4 4 4 4 4 4 ...
## $ Cd : int 4 5 4 4 3 4 4 5 5 5 ...
## $ Ce : int 3 5 3 3 3 3 4 3 3 4 ...
## $ Cf : int 2 3 4 4 3 4 5 3 3 4 ...
## $ Cg : int 3 2 4 4 4 5 5 3 5 4 ...
## $ Ch : int 4 4 2 3 4 4 3 3 5 4 ...
## $ Da : int 3 4 1 2 3 3 2 2 4 1 ...
## $ Db : int 4 3 4 4 4 5 4 4 2 4 ...
## $ Dc : int 4 3 4 5 4 4 4 4 4 4 ...
## $ Dd : int 5 4 1 2 4 4 5 3 5 2 ...
## $ De : int 4 3 4 5 4 4 5 4 4 2 ...
## $ Df : int 2 2 1 1 2 3 1 1 4 1 ...
## $ Dg : int 4 3 3 5 5 4 4 4 5 1 ...
## $ Dh : int 3 3 1 4 5 3 4 1 4 1 ...
## $ Di : int 4 2 1 2 3 3 2 1 4 1 ...
## $ Dj : int 4 4 5 5 3 5 4 5 2 4 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
# Create the analysis dataset with variables gender, age, attitude,
# deep, stra, surf and points. First copy the data wrangling statements from
# the datacamp exercises.
data$attitude = data$Attitude / 10
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30",
"D06", "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21",
"SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20",
"ST28")
deep_columns <- select(data, one_of(deep_questions))
data$deep <- rowMeans(deep_columns)
surface_columns <- select(data, one_of(surface_questions))
data$surf <- rowMeans(surface_columns)
strategic_columns <- select(data, one_of(strategic_questions))
data$stra = rowMeans(strategic_columns)
# Select only the desired fields for the analysis data set
anal <- select(data, c("gender", "Age", "attitude", "deep", "stra", "surf",
"Points"))
colnames(anal)[2] <- "age"
colnames(anal)[7] <- "points"
# Exclude observations where the exam points variable is zero from the
# analysis data set
anal <- filter(anal, points != 0)
# Show the structure of the analysis dataset
print("=== analysis dataset")
## [1] "=== analysis dataset"
str(anal)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# Write the analysis data set into a file
write.csv(anal, "data/analysis-dataset.csv")
# Demonstrate that we can read back the analysis data set
readback <- read.csv("data/analysis-dataset.csv")
print("=== structure of the readback dataset")
## [1] "=== structure of the readback dataset"
str(readback)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
print("=== head of the readback dataset")
## [1] "=== head of the readback dataset"
head(readback)
## X gender age attitude deep stra surf points
## 1 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 6 F 38 3.8 4.750000 3.625 2.416667 21
The objective of the study was to analyze how various factors influence the student’s skills in statistics.
In this exercise we analyzed the dataset described in Kimmo Vehkalahti: The Relationship between Learning Approaches and Students’ Achievements in an Introductory Statistics Course in Finland, 60th World Statistics Congress (ISI2015), Rio de Janeiro, Brazil, July 2015. The overall approach is covered more deeply in Tait, Entwistle and McKune: ASSIST: The Approaches and Study Skills Inventory for Students, 1998.
The data set was collected to understand how various factors influence students’ statistics skills. It includes data for 183 students. A total of 60 variables were colleted. The variables can be grouped into those relating to the student’s background and scholastic achievements, deep questions (seeking meaning, relating ideas, use of evidence), surface questions (lack of purpose, unrelated memorizing, syllabus-boundedness), and strategic questions (organized studying, time management).
Student’s exam points are taken as measuring the level of the student’s skill that we try to explain using the other variables. For this analysis, we chose to use linear least squares regression using the student’s attitude and the averaged answers to surface questions and strategic questions as explanatory variables. We chose to not include deep questions, age, or gender as a possible explanatory variables because the assignment instructions said we should use three variables; extending the analysis to include additional variables would be straightforward.
We additionally analyze the significance of the results using the t statistic, including a look at the distribution of the regression errors and whether their distribution is approximately normal to verify that the significance testing methods chosen can be used in this situation.
The analysis was performed using R (version 3.6.3). The R scripts and their output are included below.
The original data set can be found here and its detailed description here.
We preprocessed the data by averaging the answers to the deep questions, surface questions, and strategic questions. A sum of the answers had already been precomputed, so the sums were simply divided by the number of values (10) in each case. Additionally, student’s exam points, general attitude towards statistics, age, and gender were included as possible explanatory variables.
The data preprocessing was done using the script built for the data wrangling exercise, data/create_learning2014.R, shown above.
Observations where the exam points are zero were excluded from the analysis. This left 166 observations.
To get a quick overview of the data and the relationships between points and the possible explanatory variables, we first plot points against each variable.
library(ggplot2)
data <- read.csv("data/analysis-dataset.csv")
p <- ggplot(data, aes(x=deep, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=surf, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=stra, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=attitude, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=age, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
p <- ggplot(data, aes(x=gender, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'
It is clear that gender as a two-category variable is not suitable for linear regression. The data also contains relatively few observations for age above about 27, so it is probably not a very helpful explanatory variable either. Also the deep questions have most averages concentrated between 3 and 5, so it might be less useful for regression than the other variables; it also seems to have little influence on points based on the regression line. These observations contributed to choosing attitude, surf, and stra for our analysis.
First, we compute a linear least squares regression, trying to explain points using the students’ attitude and the averaged answers for the surface and strategic questions.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
For statistically significant results we would expect p < 0.05. The estimated fit shows a positive coefficient for attitude (3.68, p=1.9 * 10-8, which is statistically significant). It also shows a negative coefficient for surface questions and a positive cofficient for strategic questions, but these are not statistically significant (p=0.47 and p=0.12, respectively).
The t statistic used for significance testing in this linear regression assumes that the distribution of errors follows the normal distribution. To validate this assumption, we first plot the residuals.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
plot(model)
The residuals vs. fitted plot shows no obvious pattern that would invalidate the p values. For a normal distribution we would expect the highest concentration of residuals around zero and the residuals should be independent of the fitted value. There is a small cluster of negative residuals near 24-27 and at extreme fitted values the residuals tend to be negative. However this effect seems small enough to not cause significant concern.
From the Q-Q plot we would expect residual distribution to match the theoretical distribution. we can see that while the regression seems to work fairly well for most data points, extreme errors on both sides tend to be more negative than predicted by a normal distribution. However the deviation is not very big. Overall, the plot looks to me as sufficiently consistent with a bounded normal distribution to consider the p values reasonably reliable, but some caution woul be warranted.
Finally, the residuals vs. leverage plot for a normal distribution should show the same spread regardless of leverage and the standardized residual density should follow a normal distribution. This appears to hold reasonably well in this case, though it is also possible that the spread decreases with leverage. There are too few data points at high leverage to be sure. Generally I would say the plot is consistent with the assumption of an approximately normal distribution of residuals.
We will now remove the variables that are not statistically significant from the regression. Thus, we compute the regression against just attitude.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude, data=data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
plot(model)
There is little change in the coefficient of attitude compared to the previous regression (now 3.53, was 3.68). Its statistical significance has slightly improved (now p=4.12 * 10-9, which is highly significant). The intercept 11.64 is also highly significant (p=1.95 * 10-9).
Looking at the residuals vs fitted plot, it now looks like a better match against the normal distribution, though there seem to be fewer than expected samples further out. The Q-Q plot has not changed much, or perhaps the negative deviation at extremes has gotten stronger.
Given the very high significance from the t test and the approximate conformance with the normal distribution, it seems likely that the result for attitude truly is significant. However, I would not trust the absolute p value as there is some deviation from a normal distribution. I would seek to confirm the p value using nonparametric tests if submitting for a peer-reviewed scientific publication.
Finally, we try to analyze the correlation and explanatory power of attitude on points. For this, we use the Pearson correlation coefficient and use R2 to measure explanatory power. This assumes a normal distribution.
library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
r = cor(data$attitude, data$points, method="pearson")
print(r)
## [1] 0.4365245
print(cat("correlation R", r, "\n"))
## correlation R 0.4365245
## NULL
print(cat("R^2", r * r, "\n"))
## R^2 0.1905537
## NULL
The R2 value of 0.19 suggests that attitude explains about 19% of the variance in points.
The student’s attitude toward statistics seems to have a statistically significant relationship with the student’s points on the exam that can be approximated with a linear model: \[ points = 3.53 \cdot attitude + 11.64 \]
This model explains about 19% of the variance in points. About 81% of variance is not explained by this model.
It should be noted that we have not analyzed whether higher points are caused by attitude or vice versa, or if the results could be better explained by some other common couse. We have merely detected a correlation.
See the script create_alc.R in the data directory on github.
In this exercise we analyzed how students’ alcohol usage can be predicted using other variables in the data set. The data set is the Student Performance Data Set from the Machine Learning Repository at UC Irwine. The dataset is described in P. Cortez and A. Silva: Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira (eds): Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008), pp. 5-12, Porto, Portugal, April 2008.
First, let’s look at an overview of the data. Looking at available field names, we select a few fields as potential candidates for further investigation.
library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt")
# Print column names
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
# Print overall structure of the data
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
The data includes numerous factors describing the the student’s background and lifestyle, as well problems with the student’s studies, such as failed classes or absences, and the student’s alcohol usage. Our task is to study the relationship between high/low alcohol consumption and the other data. In total, the data set includes 382 observations for 35 variables.
Looking at the data, I decided to choose the following variables for closer inspection:
Let’s now look at them graphically using scatterplots (with some position jitter to make the point density more visible, given the discrete valus):
ggplot(alc, aes(x=Medu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(alc, aes(x=Fedu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(alc, aes(x=goout, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(alc, aes(x=famrel, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
Let’s also look at the data numerically using covariances.
# Covariances help us understand the relationship between different variables.
sub <- select(alc, one_of(c("alc_use", "Medu", "Fedu", "goout", "famrel")))
cov(sub, sub)
## alc_use Medu Fedu goout famrel
## alc_use 0.975635899 0.01153619 0.009385607 0.43383353 -0.10978961
## Medu 0.011536189 1.18022289 0.773865963 0.08223056 0.01192783
## Fedu 0.009385607 0.77386596 1.201742452 0.04117025 0.01313710
## goout 0.433833533 0.08223056 0.041170246 1.28125902 0.08291077
## famrel -0.109789614 0.01192783 0.013137101 0.08291077 0.84938368
We can see that alc_use has a covariance of 0.43 with goout and -0.11 with famrel, while the covariances with Medu and Fedu are small.
Visually and based on the linear regression line, it does not look like parents’ education has much impact on alcohol use. Nevertheless, they might have an impact when conditioning on other variables or after eliminating the impact of other variables, so let’s keep them in the analysis. Going out and family relationships, however, show a clear impact.
Let’s now focus on high alcohol use, i.e., the binarized variable high_use, and use logistic regression to analyze the impact of the selected variables on it. We will use the bootstrap method and
library(boot)
m <- glm(high_use ~ Medu + Fedu + goout + famrel, data=alc, family="binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ Medu + Fedu + goout + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6349 -0.7725 -0.5598 0.9912 2.4030
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.73933 0.68188 -2.551 0.01075 *
## Medu -0.10465 0.14883 -0.703 0.48194
## Fedu 0.05742 0.14751 0.389 0.69707
## goout 0.80690 0.11928 6.765 1.33e-11 ***
## famrel -0.42459 0.13316 -3.189 0.00143 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 401.82 on 377 degrees of freedom
## AIC: 411.82
##
## Number of Fisher Scoring iterations: 4
Looking at the significance level of the coefficients, we can see that Medu and Fedu are not statistically significant, while goout and famrel are statistically significant at the p=0.01 level. Let’s run the regression again but without Medu and Fedu.
library(boot)
m <- glm(high_use ~ goout + famrel, data=alc, family="binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5586 -0.7385 -0.5676 0.9969 2.3758
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8742 0.6015 -3.116 0.00183 **
## goout 0.8004 0.1187 6.743 1.55e-11 ***
## famrel -0.4218 0.1330 -3.171 0.00152 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 402.32 on 379 degrees of freedom
## AIC: 408.32
##
## Number of Fisher Scoring iterations: 4
We can see that goout and famrel are still statistically significant at the p=0.01 level, with coefficients of 0.80 and -0.42, respectively. Their standard errors are at most of 1/3 of the absolute value of the means.
Let’s now look at the coefficients and their confidence intervals after exponentiation. Exponentiation effectively coverts addition into multiplication, and coefficients >1 indicate positive impact on high_use while coefficients <1 indicate negative impact on high_use.
OR <- coef(m) %>% exp
OR
## (Intercept) goout famrel
## 0.1534844 2.2264964 0.6558574
CI <- exp(confint(m))
## Waiting for profiling to be done...
CI
## 2.5 % 97.5 %
## (Intercept) 0.04577266 0.4877710
## goout 1.77580201 2.8309109
## famrel 0.50355350 0.8498297
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1534844 0.04577266 0.4877710
## goout 2.2264964 1.77580201 2.8309109
## famrel 0.6558574 0.50355350 0.8498297
We can see that the results are in line with our hypothesis and the results we got with non-exponentiated coefficients (i.e., negative value there corresponds to a value <1 in after exponentiation).
The exponentiated coefficients relate directly to the odds ratios. Essentially, in this case, an increase in goout by increases the odds by a factor of 2.2, and an increase in famrel decreases the odds by a factor of 0.66. The confidence intervals of these odds ratios are given directly by the confidence intervals for the exponentiated coefficients as reported by exp(confint(m)) above.
Let’s now explore the predictive power of the model.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Predict the probability of high_use using the model
probabilities <- predict(m, type="response")
alc <- mutate(alc, probability=probabilities)
alc <- mutate(alc, prediction=probability >= 0.5)
# Let's look at ten samples and how they were predicted vs. actual high_use
select(alc, high_use, prediction, Medu, Fedu, goout, famrel) %>% tail(10)
## high_use prediction Medu Fedu goout famrel
## 373 FALSE FALSE 1 1 2 4
## 374 TRUE FALSE 4 2 3 5
## 375 FALSE FALSE 2 2 3 5
## 376 FALSE FALSE 4 4 3 4
## 377 FALSE FALSE 2 3 2 5
## 378 FALSE FALSE 3 1 4 4
## 379 FALSE FALSE 1 1 1 1
## 380 FALSE FALSE 1 1 1 1
## 381 TRUE TRUE 3 1 5 2
## 382 TRUE FALSE 3 2 1 4
# Let's also create the confusion matrix for the prediction
table(high_use=alc$high_use, prediction=alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 27
## TRUE 64 48
Based on the ten samples, we see 8 correct predictions, one Type I error, and one Type II error. Looks reasonable. The confusion matrix shows that the prediction missed 70 cases of high alcohol use and correctly predicted 42.
Let’s also look at a scatterplot of the predictions vs. actual values (though this is not very useful as we only have four possible combinations of values - but a plot with position jitter will slow something).
ggplot(alc, aes(x=prediction, y=high_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5))
The plot supports our earlier analysis of the confusion matrics.
Let’s now use 10-fold cross-validation to estimate how sensitive the model is to sampling of the data. (Loss function was already defined above)
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation, e.g., error
cv$delta[1]
## [1] 0.2539267
The cross validation indicates an error of 0.25 for my model. This is better than the error of the model introduced in DataCamp (0.26). (Note, however, that due to the random nature of cross-validation, and using only 10-fold cross validation as mandated by the exercise, the accuracy of these errors is not very high and the errors will randomly vary somewhat from run to run. I’ve also seen runs where it is higher than 0.26.)
Let’s now see how the model error behaves as the number of variables decreases. For this analysis we’ll first use all variables (except high_use, alc_use, Dalc, Walc, probability, and prediction) as explanatory variables, and then start removing variables that have low significance. Finally, we will collect the cross-validation errors and plot the change in error as a function of the number of variables removed. We’ll use 29-fold cross-validation to reduce the estimation error while keeping running times reasonable (R warns about K=30; using 29 instead reduces garbage in this report).
# These columns are always excluded
always_exclude = c("high_use", "alc_use", "Dalc", "Walc",
"prediction", "probability")
# Dataframe where we collect the errors from each step in simplification
# (this is updated by the step() function)
results = data.frame(count=integer(), error=double(), last=character())
# Estimate a model that excludes the variables given as argument (plus the
# variables that are always excluded), performs cross-validation to estimate
# the model error, and collects the errors into ``results``.
step <- function(additional_excludes) {
excludes <- c(always_exclude, additional_excludes)
cols <- colnames(alc)[!colnames(alc) %in% excludes]
formula_text <- paste("high_use ~ ", paste(cols, collapse=" + "))
formula = as.formula(formula_text)
m <- glm(formula, data=alc)
probabilities <- predict(m, type="response")
alc <- mutate(alc, probability=probabilities)
alc <- mutate(alc, prediction=probability >= 0.5)
cv <- cv.glm(data=alc, cost=loss_func, glmfit=m, K=29)
error <- cv$delta[1] # error
count <- length(additional_excludes)
if (length(additional_excludes) == 0) {
last = "<none>"
} else {
last <- additional_excludes[[length(additional_excludes)]]
}
results <<- rbind(results, data.frame(count, error, last))
return(m)
}
# Perform one step for the exercise. We can add steps by adding more fields
# to be excluded (as many as we like).
m <- step(c()) # famsup has lowest significance
m <- step(c("famsup"))
m <- step(c("famsup", "schoolsup"))
m <- step(c("famsup", "schoolsup", "traveltime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
"nursery"))
summary(m)
##
## Call:
## glm(formula = formula, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7470 -0.2983 -0.1224 0.2964 1.0904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.023323 0.105566 0.221 0.82526
## sexM 0.181264 0.041632 4.354 1.72e-05 ***
## famrel -0.077049 0.022620 -3.406 0.00073 ***
## goout 0.135759 0.018450 7.358 1.17e-12 ***
## absences 0.011949 0.002735 4.369 1.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1632952)
##
## Null deviance: 79.162 on 381 degrees of freedom
## Residual deviance: 61.562 on 377 degrees of freedom
## AIC: 398.78
##
## Number of Fisher Scoring iterations: 2
results
## count error last
## 1 0 0.2382199 <none>
## 2 1 0.2460733 famsup
## 3 2 0.2460733 schoolsup
## 4 3 0.2382199 traveltime
## 5 4 0.2460733 Fjob
## 6 5 0.2382199 Fedu
## 7 6 0.2460733 Fjob
## 8 7 0.2329843 famsize
## 9 8 0.2382199 G2
## 10 9 0.2329843 failures
## 11 10 0.2356021 internet
## 12 11 0.2329843 G1
## 13 12 0.2356021 G3
## 14 13 0.2198953 studytime
## 15 14 0.2277487 activities
## 16 15 0.2172775 school
## 17 16 0.2172775 health
## 18 17 0.2146597 Mjob
## 19 18 0.2120419 Medu
## 20 19 0.2068063 freetime
## 21 20 0.2146597 reason
## 22 21 0.2172775 Pstatus
## 23 22 0.2172775 address
## 24 23 0.2146597 age
## 25 24 0.2041885 guardian
## 26 25 0.2172775 romantic
## 27 26 0.2120419 higher
## 28 27 0.2172775 paid
## 29 28 0.2146597 nursery
This leaves us with four variables, goout, absences, sex, and famrel that are all statistically highly significant (at p=0.001 level). The prediction error is at the 0.21 level. For the fun of it, let’s see what happens if we remove more.
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
"nursery", "famrel"))
summary(m)
##
## Call:
## glm(formula = formula, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8046 -0.3032 -0.1538 0.3365 1.0327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.263330 0.064620 -4.075 5.61e-05 ***
## sexM 0.172743 0.042136 4.100 5.07e-05 ***
## goout 0.130710 0.018646 7.010 1.10e-11 ***
## absences 0.012497 0.002768 4.514 8.50e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1678754)
##
## Null deviance: 79.162 on 381 degrees of freedom
## Residual deviance: 63.457 on 378 degrees of freedom
## AIC: 408.36
##
## Number of Fisher Scoring iterations: 2
results
## count error last
## 1 0 0.2382199 <none>
## 2 1 0.2460733 famsup
## 3 2 0.2460733 schoolsup
## 4 3 0.2382199 traveltime
## 5 4 0.2460733 Fjob
## 6 5 0.2382199 Fedu
## 7 6 0.2460733 Fjob
## 8 7 0.2329843 famsize
## 9 8 0.2382199 G2
## 10 9 0.2329843 failures
## 11 10 0.2356021 internet
## 12 11 0.2329843 G1
## 13 12 0.2356021 G3
## 14 13 0.2198953 studytime
## 15 14 0.2277487 activities
## 16 15 0.2172775 school
## 17 16 0.2172775 health
## 18 17 0.2146597 Mjob
## 19 18 0.2120419 Medu
## 20 19 0.2068063 freetime
## 21 20 0.2146597 reason
## 22 21 0.2172775 Pstatus
## 23 22 0.2172775 address
## 24 23 0.2146597 age
## 25 24 0.2041885 guardian
## 26 25 0.2172775 romantic
## 27 26 0.2120419 higher
## 28 27 0.2172775 paid
## 29 28 0.2146597 nursery
## 30 29 0.2120419 famrel
Turns out prediction error did not really change from removing famrel (even though it was statistically highly significant). Let’s see what happens if we not remove sex, the next least significant variable.
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
"Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
"activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
"Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
"nursery", "famrel", "sex"))
summary(m)
##
## Call:
## glm(formula = formula, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8224 -0.2859 -0.1266 0.3944 1.0561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.191905 0.063512 -3.022 0.00269 **
## goout 0.135835 0.018988 7.154 4.39e-12 ***
## absences 0.011713 0.002819 4.155 4.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1748772)
##
## Null deviance: 79.162 on 381 degrees of freedom
## Residual deviance: 66.278 on 379 degrees of freedom
## AIC: 422.97
##
## Number of Fisher Scoring iterations: 2
results
## count error last
## 1 0 0.2382199 <none>
## 2 1 0.2460733 famsup
## 3 2 0.2460733 schoolsup
## 4 3 0.2382199 traveltime
## 5 4 0.2460733 Fjob
## 6 5 0.2382199 Fedu
## 7 6 0.2460733 Fjob
## 8 7 0.2329843 famsize
## 9 8 0.2382199 G2
## 10 9 0.2329843 failures
## 11 10 0.2356021 internet
## 12 11 0.2329843 G1
## 13 12 0.2356021 G3
## 14 13 0.2198953 studytime
## 15 14 0.2277487 activities
## 16 15 0.2172775 school
## 17 16 0.2172775 health
## 18 17 0.2146597 Mjob
## 19 18 0.2120419 Medu
## 20 19 0.2068063 freetime
## 21 20 0.2146597 reason
## 22 21 0.2172775 Pstatus
## 23 22 0.2172775 address
## 24 23 0.2146597 age
## 25 24 0.2041885 guardian
## 26 25 0.2172775 romantic
## 27 26 0.2120419 higher
## 28 27 0.2172775 paid
## 29 28 0.2146597 nursery
## 30 29 0.2120419 famrel
## 31 30 0.2513089 sex
This time we see a significant increase in error, to the 0.25 level. We thus conclude that goout, absences, and sex is the best combination that we can get to using this method. It is, however, possible that we are at a local optimum and different choices earlier would have resulted in a better final result. However I think that is unlikely.
Let’s finish with a plot of the error as a function of the number of variables removed.
data = select(results, count, error)
plot(data$count, data$error,
xlab="variables removed", ylab="error") + title(
"Prediction error vs. number of variables removed")
## integer(0)
Let’s start by loading the libraries we’ll use.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
In this exercise we are analyzing the “Boston” dataset from the R MASS library. This example dataset is further described in D. Harrison and D.L. Rubinfeld: Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5:81-102, 1978. The data set relates to the housing values in the suburbs of Boston and several variables that might predict housing prices.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Thre are a total of 506 observations for 14 variables. All variables have numerical values.
ggpairs(Boston, axislabels="show")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'axislabels' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
We can see that the distributions of the variables vary quite a bit. Some have a neat near-normal distribution, while others are highly skewed and far from normally distributed.
Some variables are neatly linearly correlated, while others so non-linear correlation (e.g., having a “bump” in the middle in the 2-D plot).
We can see from the pairwise plots of the variables that there are significant correlations between many of the variables. In particular, if we look at the median house value (medv, bottom row / rightmost column), we can see that it might have correlations with all or most of the other variables, though in some cases the correlation might not be linear (e.g., age, dis, zn).
Let’s also plot the correlations between the variables to understand the strength of the correlations between each pair of variables:
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
We can see that medv has the highest positive or negative correlations with, e.g., rm (average number of rooms per dwelling) and lstat (percentage of population with lower status), but many other correlations are also significant.
Let’s then standardize the data set to zero mean and unit standard deviation.
# Standardize the test set
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# Create a new categorial variable with values High, Medium, Low from
# the crime rate, replacing variable ``crim`` by ``crime``
bins = quantile(boston_scaled$crim)
labels = c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim,
breaks=bins,
labels=labels,
include.lowest=TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# Divide the dataset to train and test sets, so that 80% of the data belongs
# to the train set.
cnt <- nrow(boston_scaled)
ind <- sample(cnt, size=cnt * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
str(train)
## 'data.frame': 404 obs. of 14 variables:
## $ zn : num -0.487 0.713 -0.487 -0.487 -0.487 ...
## $ indus : num 1.231 0.569 1.015 -1.202 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num 0.434 -0.783 1.194 -0.947 -0.144 ...
## $ rm : num -0.2614 -0.0507 -0.7652 2.5396 0.5542 ...
## $ age : num 0.868 0.31 1.077 0.264 0.665 ...
## $ dis : num -0.7179 -0.0855 -1.0266 -0.1424 0.2108 ...
## $ rad : num -0.522 -0.637 1.66 -0.867 -0.637 ...
## $ tax : num -0.0311 -0.8202 1.5294 -0.7846 -0.6007 ...
## $ ptratio: num -1.735 -0.118 0.806 -0.21 1.175 ...
## $ black : num -1.276 0.441 0.399 0.441 0.258 ...
## $ lstat : num -0.3981 -0.2889 1.0176 -1.1823 -0.0943 ...
## $ medv : num 0.268 -0.21 -1.526 1.758 -0.167 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 3 1 4 2 3 4 1 4 3 3 ...
str(test)
## 'data.frame': 102 obs. of 14 variables:
## $ zn : num 0.2845 -0.4872 0.0487 0.0487 -0.4872 ...
## $ indus : num -1.287 -0.593 -0.476 -0.476 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.265 -0.265 -0.144 ...
## $ rm : num 0.413 0.194 -0.399 -0.563 -0.478 ...
## $ age : num -0.12 0.367 0.615 -1.051 -0.241 ...
## $ dis : num 0.14 0.557 1.328 0.786 0.433 ...
## $ rad : num -0.982 -0.867 -0.522 -0.522 -0.637 ...
## $ tax : num -0.666 -0.986 -0.577 -0.577 -0.601 ...
## $ ptratio: num -1.458 -0.303 -1.504 -1.504 1.175 ...
## $ black : num 0.441 0.441 0.329 0.371 0.441 ...
## $ lstat : num -1.074 -0.492 0.623 0.428 -0.615 ...
## $ medv : num 0.1595 -0.1014 -0.395 -0.0906 -0.2319 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 2 2 3 3 3 3 3 3 ...
# Linear discriminant analysis for ``crime`` using all other variables as
# predictor variables
model = lda(crime ~ ., data=train)
model
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2574257 0.2500000 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 1.0014540 -0.8852866 -0.114845058 -0.8824483 0.4420553 -0.9014894
## med_low -0.1351529 -0.2871502 -0.007331936 -0.5593918 -0.1376444 -0.2977610
## med_high -0.3751653 0.1958381 0.234426408 0.4148963 0.0917352 0.4113685
## high -0.4872402 1.0149946 -0.073485621 1.0342837 -0.4118197 0.7941491
## dis rad tax ptratio black lstat
## low 0.9366433 -0.7004968 -0.7174512 -0.44921851 0.37452017 -0.76598741
## med_low 0.3143360 -0.5534047 -0.4728851 -0.04963482 0.31820716 -0.11914919
## med_high -0.3676665 -0.3871700 -0.2887957 -0.31285581 0.04684449 0.04607382
## high -0.8505090 1.6596029 1.5294129 0.80577843 -0.84650110 0.91889508
## medv
## low 0.518771304
## med_low 0.006260481
## med_high 0.152530321
## high -0.726455970
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.13201446 0.721646602 -0.88701385
## indus 0.01624405 -0.198409149 0.10939050
## chas -0.07094906 -0.019113941 0.08272880
## nox 0.22075035 -0.793025647 -1.46405727
## rm -0.07094749 -0.095678126 -0.18468592
## age 0.32737892 -0.303857932 -0.06982634
## dis -0.09801660 -0.258112426 -0.06626246
## rad 3.25270491 0.886620890 -0.12942189
## tax -0.02104907 -0.007491377 0.62487297
## ptratio 0.13377564 0.049845381 -0.24509305
## black -0.15553007 -0.008168146 0.09516348
## lstat 0.17920825 -0.243123788 0.34476706
## medv 0.12841204 -0.410734535 -0.19989112
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9480 0.0385 0.0135
classes = as.numeric(train$crime)
plot(model, dimen=2, col=classes, pch=classes)
From the plot we can see that high is relatively easily separable (except from a few med_high instances), whereas there is more difficulty with the other classes when mapped to two-dimensional space. However, they could still be easily separable in the higher-dimensional original space.
# Save the crime categories from the test set and remove the categorical
# crime variable from the test set
correct_classes = test$crime
test <- dplyr::select(test, -crime)
str(correct_classes)
## Factor w/ 4 levels "low","med_low",..: 1 1 2 2 3 3 3 3 3 3 ...
str(test)
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num 0.2845 -0.4872 0.0487 0.0487 -0.4872 ...
## $ indus : num -1.287 -0.593 -0.476 -0.476 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.265 -0.265 -0.144 ...
## $ rm : num 0.413 0.194 -0.399 -0.563 -0.478 ...
## $ age : num -0.12 0.367 0.615 -1.051 -0.241 ...
## $ dis : num 0.14 0.557 1.328 0.786 0.433 ...
## $ rad : num -0.982 -0.867 -0.522 -0.522 -0.637 ...
## $ tax : num -0.666 -0.986 -0.577 -0.577 -0.601 ...
## $ ptratio: num -1.458 -0.303 -1.504 -1.504 1.175 ...
## $ black : num 0.441 0.441 0.329 0.371 0.441 ...
## $ lstat : num -1.074 -0.492 0.623 0.428 -0.615 ...
## $ medv : num 0.1595 -0.1014 -0.395 -0.0906 -0.2319 ...
# Predict the classes with the LDA model on the test data
predictions = predict(model, newdata=test)
# Cross-tabulate the correct classes vs. predicted classes.
table(correct=correct_classes, predicted=predictions$class)
## predicted
## correct low med_low med_high high
## low 17 8 2 0
## med_low 6 14 2 0
## med_high 0 11 14 0
## high 0 0 1 27
The tabulation shows that a majority of all classifications were correct (i.e., they are on the diagonal). However, many were also classified into neighboring classes. No values were classed more than two steps away from the current class.
# Reload the original Boston dataset
data("Boston")
# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))
# Calculate distances between the (scaled) observations. We use the
# Euclidean distance here (i.e., L2-norm). Are we supposed to use these
# distances somehow? The instructions do not say so.
dist_eu <- dist(boston_scaled, method="euclidean")
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# Run k-means on the dataset.
km <- kmeans(boston_scaled, centers=4)
# Visualize the clustering for a few variable pairs as a sample
pairs(boston_scaled[6:10], col=km$cluster)
# Investigate the optimal number of clusters. The "optimal" value is said to
# be where the WCSS drops radically. (In real applications with overlapping
# clusters it is often not quite this simple.)
k_max = 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# Plot the WCSS as a function of the number of clusters.
qplot(x=1:k_max, y=twcss, geom="line")
We can see from the plot that WCSS drops significantly between 1 and 2 (though there is additional decrease with more clusters). Based on this, we pick 2 as the optimal number of clusters and run k-means again with that number of clusters. It would not be unreasonable to choose a higher number of classes, but there is no additional dramatic drops at higher numbers of classes.
# Run k-means again with the "optimal" number of clusters
km <- kmeans(boston_scaled, centers=2)
# Visualize the clusters (the "optimal" number of them)
pairs(boston_scaled, col=km$cluster)
# Given that visualizing all pairs is near unreadable, visualize for a small
# sample of variables. We could use select() to pick a list of named
# variables instead, but instructions don't require it.
pairs(boston_scaled[6:10], col=km$cluster)
Looking at the visualized variables, we can see that some pairs of variables cluster the data pretty well. For example, tax, radf, and dis all seem to offer relatively good classifications.
# Reload the original Boston dataset
data("Boston")
# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))
# Cluster into 3 clusters using k-means
km <- kmeans(boston_scaled, centers=3)
# Perform LDA using the clusters as target classes
lda.fit <- lda(x=boston_scaled, grouping=km$cluster)
# print the lda.fit object
lda.fit
## Call:
## lda(boston_scaled, grouping = km$cluster)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2786561 0.4130435 0.3083004
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3989734 1.1241495 -0.9749470 -0.04894749 -0.8314162 0.9328502
## 2 -0.3763256 -0.3947158 -0.1595373 0.01023793 -0.2992872 -0.2772450
## 3 0.8647905 -0.4872402 1.0949412 0.03052480 1.1524404 -0.4717159
## age dis rad tax ptratio black
## 1 -0.90687091 0.8988453 -0.5892747 -0.6843385 -0.77780359 0.3568316
## 2 0.02424663 0.0304962 -0.5856776 -0.5482746 0.09789185 0.2769803
## 3 0.78718751 -0.8532750 1.3172714 1.3530841 0.57186480 -0.6936034
## lstat medv
## 1 -0.92612124 0.9888052
## 2 -0.03824517 -0.1525634
## 3 0.88830984 -0.6893319
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.03223565 0.18654924
## zn -0.04675240 0.87238302
## indus 0.40550042 -0.16163062
## chas 0.08979225 -0.11703465
## nox 1.05432570 0.68362018
## rm -0.26923690 0.50640378
## age -0.21593250 -0.45834423
## dis 0.06906243 0.34138098
## rad 1.15943286 0.38841914
## tax 0.99119613 0.47397797
## ptratio 0.07545578 -0.16725670
## black -0.07369899 -0.03242015
## lstat 0.26131112 0.32118017
## medv -0.03902737 0.67098685
##
## Proportion of trace:
## LD1 LD2
## 0.8724 0.1276
# Visualize the results using a biplot with arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col="black", pos=3)
}
# Plot the LDA results using two dimensions, with arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 1)
# Plot again with much bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 5)
# Plot again with even bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 15)
The length of the arrows reflects the directions of the different dimensions (variables) relative to the hyperplane chosen for visualizing the clusters. The longer the arrow, the closer it is to the direction of the hyperplane; the shorter, the closer it is perpendicular to the hyperplane.
It looks like the plane that R chose for visualization (which seems to separate the clusters reasonably well) is most parallel to rad (the nitrogen oxygen concentration), the second most parallel to tax (the accessibility of radial highways), and third most parallel to age (proportion of owner-occupied units built prior to 1940).
# Save gold standard classes from the training data
train_correct <- train$crime
# Recompute the LDA model from the training data
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 14 2
model <- lda(crime ~ ., data=train)
# Map the datapoints into the visualization space
matrix_product <- as.matrix(model_predictors) %*% model$scaling
matrix_product <- as.data.frame(matrix_product)
# Create a 3D plot with color indicating the gold standard crime
# classes in the train set
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
type="scatter3d", mode="markers",
color=train_correct, marker=list(size=2))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Create a 3D plot with color indicating the predicted crime classes. Since
# the gold data used four classes, we'll predict four classes.
train_no_crime = dplyr::select(train, -crime)
km <- kmeans(train_no_crime, centers=4)
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
type="scatter3d", mode="markers",
color=km$cluster, marker=list(size=2))
Both plots contain the same training points, so their point locations mapped to the visualization space are identical. We can see that the green, orange, and blue classes in the original data overlap quite a bit and are not well separated. The prediction, however, produces much shaper class boundaries. This is as expected, because clustering is based on a distance to cluster centers. While there are several dimensions not shown in these plots, one could expect these dimensions to partially reflect the distance from cluster centers. This is consistent with what we are seeing in the plot for predicted data. Except for the sharpening of class boundaries and the selection of a different color scheme, the plots are fairly similar.
The data wranling script that was part of Exercise 4 can be found in my github repository in the data directory, here.
Thoughts after the first week:
git pull or git push did not work properly with SSH keys (it asked for git username and password). I had to tweak it manually to make SSH keys work correctly.Thoughts for the second week:
Thoughts after the third week
Thoughts after the fourth week
(more chapters to be added similarly as we proceed with the course!)